-
Notifications
You must be signed in to change notification settings - Fork 694
feat(streaming): support locality enforcement and locality backfill #23275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…input has an shuffle. It is used to ensure locality provider is in its own fragment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces locality enforcement and locality backfill to RisingWave's streaming engine, extending the existing index selection functionality to improve performance during large historical data backfilling in memory-limited scenarios.
Key changes include:
- Implementation of LocalityProvider operators that buffer data with locality column ordering during backfill
- Extension of dependency ordering to ensure proper sequencing between scan backfill and locality backfill operations
- Addition of session configuration to enable/disable locality backfill functionality
Reviewed Changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
src/stream/src/executor/locality_provider.rs | Core executor implementation for locality-aware backfilling with proper state management |
src/stream/src/from_proto/locality_provider.rs | Protocol buffer conversion and executor builder for LocalityProvider |
src/frontend/src/optimizer/plan_node/stream_locality_provider.rs | Stream plan node implementation with state and progress table catalog building |
src/frontend/src/optimizer/plan_node/logical_locality_provider.rs | Logical plan node with column pruning and predicate pushdown support |
src/meta/src/stream/stream_graph/fragment.rs | Fragment dependency analysis for LocalityProvider ordering |
src/meta/src/model/stream.rs | Backfill upstream type classification for LocalityProvider |
src/common/src/session_config/mod.rs | Session configuration parameter for enabling locality backfill |
proto/stream_plan.proto | Protocol buffer definition for LocalityProviderNode |
Comments suppressed due to low confidence (1)
src/stream/src/executor/locality_provider.rs:1
- The magic number 1024 should be defined as a named constant or made configurable to improve maintainability and allow tuning.
// Copyright 2025 RisingWave Labs
} | ||
} | ||
|
||
// TODO: truncate the state table after backfill. |
Copilot
AI
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO comment indicates incomplete functionality. State table truncation after backfill should be implemented or tracked in a proper issue management system.
// TODO: truncate the state table after backfill. | |
// Truncate the state table after backfill to free resources. | |
state_table.truncate().await?; |
Copilot uses AI. Check for mistakes.
This PR is ready to review.
|
Very cool, I'll take a look tomorrow! |
This pull request introduces a new "locality backfill" feature for streaming queries, which enables more efficient backfilling by grouping and buffering input data using specified locality columns. The changes span the SQL test suite, session configuration, planner, protobuf definitions, and catalog metadata to support this feature end-to-end. The most important changes are grouped below: Locality Backfill Feature Implementation
Configuration and Test Coverage
Catalog and Utility Updates
These changes collectively enable more granular and efficient backfill operations in streaming queries, controlled by a session-level configuration and fully integrated into the query planner and execution engine. |
I'm wondering if it's possible to split this PR into smaller ones to make the review process easier? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! Reviewed all files except stream
.
src/frontend/src/optimizer/plan_node/generic/locality_provider.rs
Outdated
Show resolved
Hide resolved
Some( | ||
LogicalLocalityProvider::new( | ||
self.clone_with_left_right(self.left(), self.right()).into(), | ||
columns.to_owned(), | ||
) | ||
.into(), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized that we generate a LocalityProvider
only until we encounter an upstream with different locality. If I'm understanding correctly, we will generate a plan like
Join -> LocalityProvider -> Filter -> Project -> Agg
instead of
Join -> Filter -> Project -> LocalityProvider -> Agg
Does it mean that we need to buffer unnecessary data that could have been filtered out by those stateless executors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good insight. To resolve this issue, I think we should add the LocalityProvider to the requirement side instead of the provider side.
if let Some(better_plan) = self.try_better_locality_inner(columns) { | ||
Some(better_plan) | ||
} else if self.ctx().session_ctx().config().enable_locality_backfill() { | ||
Some(LogicalLocalityProvider::new(self.clone().into(), columns.to_owned()).into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern seems to appear multiple times. Is there any way to extract it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I will refactor this part together with this optimization in the following PR.
// If the input is hash-distributed, we make it a UpstreamHashShard distribution | ||
// just like a normal table scan. It is used to ensure locality provider is in its own fragment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that this is actually a workaround for the backfill order control to work correctly? I guess we can keep the HashShard
and omit one shuffle if it's solely for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only for backfill order control but also for backfilling itself. Considering the downstream is a HashJoin and the both inputs distribution are HashShard
and satisfy the join key distribution requirement. Then in the same fragment, we have 2 backfilling node, and it is coupled and not scalable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then in the same fragment, we have 2 backfilling node, and it is coupled and not scalable.
This is true, because we want different tables to be scaled independently regardless of whether there's a downstream job joining them together.
However, for locality backfill nodes, I think it's okay to couple the parallelism and distribution here, as we don't support scaling a specific fragment now (cc @shanicky). If having multiple backfill nodes in a single fragment/actor could be a concern of performance, we can also split it using no-shuffle exchanges, which also maintain the existing assumption that there should be at most one backfill node per fragment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems to introduce unnecessary Exchange
nodes, which may harm performance even after the backfilling stage is finished... 🤔
src/frontend/src/optimizer/plan_node/stream_locality_provider.rs
Outdated
Show resolved
Hide resolved
|
||
// Set locality columns as primary key. | ||
for locality_col_idx in self.locality_columns() { | ||
catalog_builder.add_order_column(*locality_col_idx, OrderType::ascending()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realize that ideally we should also have Order
as a requirement for "better locality".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it seems the order doesn't matter, any order can provide the locality we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some brainstorm: if the downstream is a MAX
aggregation, then descending order may provide better locality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the ordering here for MAX
aggregation is an optimization of CPU (like we have a shortcut here do not need to update the TopN cache) instead of IO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use Order
as a requirement for "better locality", we need to enumerate all the possibility combination in the requirement side which will cause the algorithm complexity grows up fast.
/// Schema: | vnode | pk(locality columns + input stream keys) | `backfill_finished` | `row_count` | | ||
/// Key: | vnode | pk(locality columns + input stream keys) | | ||
fn build_progress_catalog(&self, state: &mut BuildFragmentGraphState) -> TableCatalog { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we unify this with other backfill nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table scan backfill progress table schema is not always the same as the locality backfill, when the StreamScanType
is SnapshotBackfill, so we can't unify it.
risingwave/src/frontend/src/optimizer/plan_node/stream_table_scan.rs
Lines 139 to 186 in b4de83b
/// Build catalog for backfill state | |
/// | |
/// When `stream_scan_type` is not `StreamScanType::SnapshotBackfill`: | |
/// | |
/// Schema: | vnode | pk ... | `backfill_finished` | `row_count` | | |
/// | |
/// key: | vnode | | |
/// value: | pk ... | `backfill_finished` | `row_count` | | |
/// | |
/// When we update the backfill progress, | |
/// we update it for all vnodes. | |
/// | |
/// `pk` refers to the upstream pk which we use to track the backfill progress. | |
/// | |
/// `vnode` is the corresponding vnode of the upstream's distribution key. | |
/// It should also match the vnode of the backfill executor. | |
/// | |
/// `backfill_finished` is a boolean which just indicates if backfill is done. | |
/// | |
/// `row_count` is a count of rows which indicates the # of rows per executor. | |
/// We used to track this in memory. | |
/// But for backfill persistence we have to also persist it. | |
/// | |
/// FIXME(kwannoel): | |
/// - Across all vnodes, the values are the same. | |
/// - e.g. | |
/// | vnode | pk ... | `backfill_finished` | `row_count` | | |
/// | 1002 | Int64(1) | t | 10 | | |
/// | 1003 | Int64(1) | t | 10 | | |
/// | 1003 | Int64(1) | t | 10 | | |
/// | |
/// Eventually we should track progress per vnode, to support scaling with both mview and | |
/// the corresponding `no_shuffle_backfill`. | |
/// However this is not high priority, since we are working on supporting arrangement backfill, | |
/// which already has this capability. | |
/// | |
/// | |
/// When `stream_scan_type` is `StreamScanType::SnapshotBackfill`: | |
/// | |
/// Schema: | vnode | `epoch` | `row_count` | `is_epoch_finished` | pk ... | |
/// | |
/// key: | vnode | | |
/// value: | `epoch` | `row_count` | `is_epoch_finished` | pk ... | |
pub fn build_backfill_state_catalog( | |
&self, | |
state: &mut BuildFragmentGraphState, | |
stream_scan_type: StreamScanType, | |
) -> TableCatalog { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Executor part looks good to me.
I think to split the PR, I need to guarantee after merging everything is still working as expected, considering right now the code is not that large and it is a minimal PR to demonstrate the functionality, I will keep it as is. Any further optimization would be listed in the tracking issue. |
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
try_better_locality
method to enforce locality for an operator if it can't provide. The operator calledLocalityProvider
. Currently, we only generate this operator for LogicalJoin and LogicalScan. With this ability, users don't even need to create index by themselves. But if users know their workload well, they can create indexes to share indexes across different jobs.LocalityProvider
operators could depend on each other. We extend these dependencies ordering control based on feat(meta,frontend,streaming): support fixed backfill order control #20967.Performance
TPCH Q18 with scale = 1g
We limit the compute node memory to 2g
Backfill throughput
With locality backfill, our backfilling could finish in 7min, while without the locality backfill, it jobs is too slow to finish.
Cache miss ratio
With locality backfill, our cache miss ratio is much lower than without it.
Checklist
Documentation
Release note